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Abstract 

Array comparative genomic hybridization(CGH) is a high resolution technique to assess DNA 

a ' 
^__» ■ 

copy number variation. Identifying breakpoints where copy number changes will enhance the under- 
standing of the pathogenesis of human diseases, such as cancers. However, the biological variation 
" and experimental errors contained in array CGH data may lead to false positive identification of 

\o'. 

7— ( ■ breakpoints. We propose a robust state space model for array CGH data analysis. The model con- 
sists of two equations: an observation equation and a state equation, in which both the measurement 
^ \ error and evolution error are specified to follow t-distributions with small degrees of freedom. The 
^ " completely unspecified CGH profiles are estimated by a Markov Chain Monte Carlo (MCMC) algo- 

• rH ■ 

^ • rithm. Breakpoints and outliers are identified by a novel backward selection procedure based on 
H ! 

posterior draws of the CGH profiles. Compared to three other popular methods, our method demon- 
strates several desired features, including false positive rate control, robustness against outliers, and 
superior power of breakpoint detection. All these properties are illustrated using simulated and real 
datasets. 
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1 Introduction 



Almost all types of cancer share one common characteristic, genetic instability, including DNA copy 
number variation(CNV). During cancer progression some genes will lose one of the two copies or are 
completely deleted, while others may gain one copy, or become amplified up to hundreds of copies. 
These chromosomal alterations can lead to abnormal cell proliferation, DNA repair, senescence 
and apoptotic mechanisms and can provide a selective advantage for cells and result in cancer. 
Identification of CNV not only enhances the understanding of oncogenesis but also facilitates the 
treatment of cancer. For example, Trastuzumab is a monoclonal antibody interfering with ERBB2 
receptor and i s used for the trea tment of breast cancers with amplified, and multiple copies of the 



ERBB2 gene (jVogel et al. 



20021 ). 



Array comparative genomi c hybridization (CGH) is a technique that is used to detect differ- 



ences in DNA copy number fjSolinas-Toldo et al 



1997 



Pinkel et al 



19981 ). The isolated DNA 



from tumor and the normal tissue from each patient are labeled with different fluorescent dyes 
and then cohybridized to the microarray. The log 2 fluorescent intensity ratios are measured at 
different chromosomal positions to define each CGH profile. This CGH profile is supposed to be 
proportional to the copy numb er ratio for t u mor and normal cells across the chromosome. See 
Pinkel fc Albertson ( 2005 ) and Davies et all ( 2005 ) for detail reviews. Array CGH data exhibit 
three challenging characteristics. First, the data displays abrupt changes at the positions where 
DNA copy number is possibly altered. Second, the data usually contain biological variations and 
experimental errors, which hinder the accurate identification of breakpoints where copy number 
changes. Biological variations refer to heterogeneity of copy number within tumor cells and ex- 
perimental errors include contamination of the tumor cells with normal cells, measurement errors 
and errors caused by processing tissue samples. Third, the data are spatially dependent. That is, 
neighboring genes are more likely to share the same copy number than remote ones. The primary 
aim of array CGH data analysis is to estimate the CGH profiles and to identify breakpoints from 
available noisy observations. 

A number of statistical methods have been proposed for array CGH data analysis. Most of the 
methods postulate that the observed log 2 intensity ratio Y(tj) is governed by the following model, 

Y(t j ) = »(t j ) + e(t j ), j = 1,2,..., J (1) 

where signal fi(tj) is the true log 2 intensity ratio at jth probe, e(tj) is noise and tj denotes the 



physical position of jth probe on a chromosome. Different assumptions and interpretations of n(tj) 
and s(tj) lead to various estimation approaches, which may be categorized into three types. The 
first type is based on the segmentation method. It assumes that the CGH profile fi(t), is piecewise 
constant, i.e. = J2m=i ^ml\tj € T m ], where T m is segment m with mean \i m and /(■) is the 

indicator function. Also e(tj) follows independent and identically distribut ed (i.i.d.) A^ f O, a 2 ^ 



To detect breakpoints that enable us to classify chromosome into blocks, 



and 



Qlshen et al 



(120041 ) 



Venkatraman fc Qlshen! (120071 ) proposed the method of circular binary segmentationfCBS) ; 



Hupc 



et al\ (120041 ) developed the adaptive weighted smoothing procedure; and 



Erdman fc Emerson 



( 120081 ) implemented a Bayesian change point model. The second type is the method of hidden 



Markov models (HMM), which restricts fi(t) to take a finite number of values and uses a Markov 
chain to model probabilities: Pr(//(tj + i) = | fi(tj) = /i m ), [i m , fi m ' £ with state space 
U — {Hm', Tn = 1, 2, ... , M}. Note that M is a prespecified number o f states. The HMM method 



was first applied to array CGH data analysis by 



Fridlyand et al. 



(120041 ) . 



Shah et al. 



(12006f l modified 



the HMM method to achiev e robustness against outliers. A continuous-index HMM was developed 



by IStjernqvist et al\ ( 120071 ). 



Guha et al. 



(120081 ) derived a Bayesian approach to the HMM with 



Andersson et al. 



object ive decision rules. A segmental maximum posteriori approach(SMAP) by 
( 120081 ) has incorporated both genomic distance and overlap between clones into the HMM. Finally, 
the third type is built upon penalization methods, which essentially relax the piecewise constant 
assumption by imposing a roughness penalty on CGH profile /i(t). In a penalization method, we 
consider minimizing an objective function of the form Q = Q g f + Q sp , where the first term Q g f 
measures the goodness of fit for profile jj,{t) to the observed process Y{t) at observed probes t'jS, 
and the second term Q sp regularizes the smoothness of nit). Various forms of Q g j and Q Kp have 
been proposed in th e literature, including quant ile smooth i ng (lEilers fc De Menezesl . 120051 ) . LASSO 



( Huang et al 



Tibshirani &: Wang 



2005) fuse d quantile regression ( ILi fc Zhul . 120071 ) and fused LASSO (FLASSO) by 



(120081 ) . 



Be sides the three types of met 



rithm (IWang et al 



( Ivan de Wiel et al. 



2005 



rods, there are other appr oaches; for 



Liu et al. 



20061 1 , wavelet transform (IHsu et all 120051 ) and ridge regression 



exam ple, clustering algo- 



20091 ). among many others. Co mpre hensive comparisons among som e of afore- 



mentioned methods were given by 



Lai et al. 



Willenbrock fc Fridlyand! ( 120051 ). Some of 



(120051 ) and 

the methods only estimate the profiles but do not directly call the breakpoints. Further, most 
methods do not control the false positive rate for breakpoint identification, and their performances 
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are significantly effected by the experimental errors, such as outliers. 

In this paper, we propose a new method based on robust state space models for array CGH 
data to estimate the CGH profile and to identify breakpoints under controlled false positive rates. 
In addition, this new method has a number of desirable properties: (1) it is robust against outliers; 
(2) it incorporates physical distance between probes into CNV identification; (3) it enables us to 
quantify estimation uncertainties of signals via posterior credible intervals; (4) all the parameters are 
estimated as part of the MCMC algorithm and thus are highly data-adaptive; (5) the computational 
efficiency of the MCMC algorithm for profile estimation is proportional to the number of probes, 
which helps the computation speed for high-throughput array CGH data analysis. 

The rest of the paper is organized as follows. In Section 2, we first present the robust state 
space model, then describe an MCMC algorithm to draw samples of both profiles and parameters, 
and also outline a novel procedure of calling the breakpoints and outliers using MCMC samples. 
In Section 3, the proposed model and method are applied to both simulated and real datasets for 
illustration, where we compare our new method to three popular existing methods. We finally give 
conclusions and discussion in Section 4. 

2 Methods 
2.1 Model 

For the ease of exposition, we denote Yj = Y(tj), \ij = fi(tj) and Sj = e(tj). The proposed 
robust state space model(RSSM) comprises two equations: an observation equation and a state 
equation. The observation equation is given in equation (IT]), where error Sj is assumed to be 
i.i.d and follow t-distribution, % e , with degree of freedom(d.f.) v £ . Note that t-distribution is a 
scale-mixture of normal distribution and gamma distribution. Thus, we rewrite Ej ~ A/"(0, •) a 
normal distribution with mean and variance of j, and let <t~| = X £ jt £ and \ £ j ~ Q(v £ /2,v £ /2) a 
gamma distribution with shape parameter v £ /2 and rate parameter v £ /2. The priors are specified 
as v £ ~ £(1(T 3 , l(T 3 )/(2, 10) and r £ ~ £(1(T 3 , 1(T 3 ) throughout the paper. 

We regard the signal f-i(tj) as a continuous quantity which measures the log 2 of average copy 
number of heterogeneous tumor cells versus homogeneous normal cells. The state equation is: 

/'/• . I'j ■ s,- (2) 
4 



where the evolution error or signal difference £j follows an i.i.d t-distribution with d.f v^. Sim- 



ilar to the specification of Ej, we let ~ J\f(0,a^jSj), 5j = — tj, cr,J = ^t,j T t an d 
g(v $ /2,v^/2), with the priors v i ~ £(10~ 3 , 10- 3 )/(0.01, 2) and r 5 ~ ^(lO^ 3 , 10" 3 ). As a result 
e,- ~ 7^.(0, rr 1 ) and £, ~ 7^,(0, (5 ,-Tr 1 ) marginally. Unlike other robust state space models (IWest 



1984 



Fahrmeir fc Kiinstler 



1999), our model incorporates the physical distance 5j between two 



probes to address the feature that the farther two probes are apart, the larger the signal difference 
£j is likely to be. Note that degree of freedom v% is limited below 2. In this way, we hope that the 
distribution % can accommodate extremely la rge values of sign al difference probably caused by 
breakpoints. A similar strategy was suggest by iKitagawal ( 119871 ). where differences of signals are 
modeled by a distribution in the Pearson system with no finite second moments. As shown in his 
paper, the Pearson system distribution facilitates the detection of mean structure changes. 



2.2 Signal extraction by MCMC 

With the model formulation given in Section 2.1, we now outline an MCMC algorithm to sample 
from the posterior distribution for signals n = [//i,/i 2 , • • • ,^j] T , parameters <p = [X e j,v E ,r e ] and 
<p s = [X u , V£, r 5 ] for j = 1, 2, . . . , J, given the data Y = [Y l7 Y 2 , . . . , Yj}. 



Given Y, <fi and <f> a , update the \x by the simulation smoother (IDurbin fc Koopmanl . 120021 ). a 
multi-state Gibbs sampler which very efficiently draws samples from the posterior distribution 
of signals /x. 

Given Y and /j,, update (p according to the following steps: 



[A £>i |-]~S(f + |, f 



1 V e j_ (Yj-HjfTe 



Hj=i I ^ 



Sampling (ARMS; 



Gilks et al 



G(v, I 10" 3 , 10- 3 )/(2, 10), by Adaptive Metropolis Rejection 



19951 ) 



T £ ~g(j + io- 3 ,e=i 



+ 10 



-3\ 



Given /x, update 4> s through the following steps: 



YlU G ^ I f ' f )$( v t I 10 ~ 3 > 10- 3 )/(0.01, 2), by the ARMS; 



r,~s(! + nr 3 ,£ =1 



-3 V^^ (Mj+i-Mj) 2 ^, 



2,) r 



10 



-31 



According to the definition of errors Ej = Yj — fij and £j = — fij, we obtain the posterior draws 
of the errors e = [si, s 2 , ■ ■ ■ , £j] T and the signal differences £ = [£ 2 , £3, • • • , 0] T - Samples of e and 
£ are essential to identify outliers and breakpoints through a novel backward selection procedure 
detailed in Section 2.3 below. 



2.3 Breakpoints and outliers calling 

Breakpoints are called by our backward selection procedure outlined in Algorithm [1] given in the 
Supplementary Material. The input to the algorithm is the posterior draws of signal differences £, 
in an m x n matrix, with m denoting the number of draws and n equal to the number of probes 
minus one, as well as an input of a threshold q a . The specification of q a is discussed in detail 
below. At line in Algorithm [TJ we calculate Pj, which is an estimate of the posterior probability 
I > I "¥]. This is the probability of the absolute value of signal difference at position j is 
larger than those at any other positions, given the dat a. The quantity P[|£j| > \ Y] represents 



the area under the ROC curve or AUC (jPepel . 



2004 



Ch.4). It is known that AUC measures the 



separation between the posterior distribution of and that of the remaining namely all |^| 
with i j. Under the null hypothesis that probe j is not a breakpoint, we expect Pj to be near 
0.5. The decision of rejection of the null hypothesis will be based on the comparison of Pj with 
the threshold q a . In the first iteration of procedure, several Pjs may be larger than q a ; we take 
the largest one and call it a breakpoint. This called position will be excluded from the subsequent 
iterations, we repeat his calling procedure for the remaining £j until none of the remaining Pj is 
above the threshold q a or all n — 1 breakpoints are selected. The output of the algorithm is a list of 
identified breakpoints. Likewise, we utilize this backward selection procedure to call outliers, based 
on the posterior draws of errors e. 

In the above backward selection procedure for the calling of breakpoints, the q a is determined 
such that under the null hypothesis that probe i is not a breakpoint, it will be chosen with probability 
a (i.e. a is false positive rate). When a normal reference array is available, we can measure log 2 
intensity ratio of normal versus normal tissue. Fitting the proposed state space model to the normal 
reference array, we obtain the posterior draws of signal differences £°, where we can obtain P? s 
according to Algorithm [TJ These P°, j — 1, 2, ... , J°, can be regarded as a random sample from a 
distribution under the null hypothesis. Then, the q a is obtained as the (1 — a) quantile of all P°'s. 
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This quantile q a implies that under the null hypothesis, the rate of false positive is a. In some real 
experiments, normal reference arrays however may not be available. In this case, we can generate 
a pseudo normal reference array Y° = [Y°, Y£, . . . , Yj a ] by sampling with replacement from the 
data Y. In this case, if some Y-s in the aberration region are sampled, they will be dispersed and 
scattered randomly within the set Y°. Thus, they appear most likely as outliers rather than a 
contiguous pattern of changes. Since the proposed state space model is robust against outliers, the 
q a under the null hypothesis can be reasonably determined. Given the pseudo normal array, the 
steps to obtain the q a are the same as those given in the scenario of the normal reference array 
being available. 



3 Applications 



3.1 Simulation study 



We first evaluate our proposed method and compare it with three other 



popular method s, FLASSO 



Lai et al 



fl2005f ). Lai et al.'s 



CBS and SMAP, using well known artificial chromosomes simulated by 
data consist of 100 chromosomes, each with length 100. In the center of each chromosome is added 
an aberration of copy number gain, which has one of the four different width (5,10,20 and 40). The 
signal-to-noise ratio(SNR) is 1, and the noise follows a normal distribution with standard deviation 
0.25. 

We use the Receiver Operating Characteristic(ROC) curve to compare the performance of the 
four methods in each width case. To obtain ROC curves, we compare the estimated signal fij at each 
location with a cutoff varying from the minimum to the maximum of Y, and regard the location 
i where /2j is above the cutoff as the detected aberration region. The true positive rate(TPR) is 
defined as the proportion of the true aberration region detected as an aberration region, while the 
false positive rate(FPR) is defined as the proportion of the normal region declared as an aberration 
region. The TPRs and FPRs are plotted as ROC curves in Figure [H For the Lai et al.'s data, the 
plots at the first row in Figure [1] indicate that our approach performs clearly better than CBS and 
SMAP methods, in terms of higher TPR and lower FPR, not as well as FLASSO for the narrow 
regions but comparably to FLASSO for th e wide aberrations(20 and 40). 



The simulated data in 



Lai et al 



(120051 ) is idealized, and does not contain any of the complex 
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features that occur in real data. Outliers are commonly seen in real datasets for various reasons, 
including single probe amplification/deletion or experimental errors. To investigate the effect of 
outliers, in Lai et a/.'s simulated dataset, we add five percent of outliers in each chromosome at 
randomly selected positions with magnitudes uniformly distributed over interval (3,6). The ROC 
curves given at the second row in Figure [T] clearly show the advantage of the proposed method. 
Comparing to the corresponding cases in the first row, the ROC curves of FLASSO, CBS and SMAP 
are considerably closer to the diagonal line, demonstrating a significant loss of prediction power for 
the detection of CNVs. In contrast, the ROC curve of the proposed approach is affected very little, 
indicating clearly that our method is robust to outliers. 

Another feature of the real data is the possibility of more than one region of aberration with 
different magnitudes. To evaluate the performance of the methods, we explore cases when two 
aberration regions are present in the simulated chromosome simultaneously. For each Lai et a/.'s 
simulated chromosome, a randomly selected normal region of width five is replaced by an aberration 
block with SNR 4. Based on the ROC curves plotted in the third row of Figure HJ the proposed 
approach outperforms the three other approaches. 

An important task in array CGH analysis is to correctly identify breakpoints. We investigate the 
number of breakpoints identified by the four methods for each chromosome in the above simulated 
data. In addition, we simulated normal chromosomes without any aberration regions. For these we 
generate 100 normal chromosomes, each with 100 probes simulated from A/"(0, 0.25 2 ). In addition, 
another 100 chromosomes are generated by adding outliers to the 100 normal chromosomes, in the 
same way described above. For FLASSO, CBS and SMAP, a breakpoint is defined as a position j, 
if the difference is non-zero, that is, Ajij = /L, +1 — fij ^ 0. For the proposed method, a breakpoint 
is called by the backward selection procedure as described in Algorithm [TJ To determine the q a , we 
simulate a normal reference array with each probe as J\f(0, 0.25 2 ) with length J° = 1000 and generate 
the pseudo normal reference arrays with length J° = 1000 through sampling with replacement from 
the artificial chromosomes. The false positive rate a is set at 0.001, which means that for every 
1000 probes in the normal reference array, one probe is expected to be falsely called as a breakpoint. 
Figure [2] shows the side-by-side boxplots of the number of breakpoints identified by each of four 
method respectively, where the q a is determined with simulated normal reference arrays and pseudo 
normal reference arrays, respectively, for RSSM0 and RSSM1 corresponding to the first two boxplots 
in each panel. From a comparison of these boxplots, it is clear that the number of breakpoints is 

8 



over-estimated substantially by FLASSO in all the three scenarios although the magnitude of the 
signal difference at some of these breakpoints may by quite small. The true number of breakpoints, 
on average, is more likely to be correctly achieved by the proposed method, in scenario of two pieces 
of aberration regions or in the cases where the aberration widths are as wide as 20 and 40. For the 
normal chromosomes with or without outliers, both CBS and our method correctly conclude that 
there are no breakpoints, while FLASSO identifies a few number of false breakpoints. Note that our 
method identifies a total of 6 and 13 breakpoints for 10,000 probes in 100 normal chromosomes by 
using, respectively, simulated and pseudo normal reference arrays. These number of false discoveries 
numbers are close to the expected number 10, given the false positive rate 0.001. We also notice 
that the numbers of breakpoints identified in the simulated and pseudo normal reference arrays are 
very close to each other, which validates the utility of pseudo normal reference arrays when the 
normal reference arrays are not available. 

[Figure 1 about here.] 

[Figure 2 about here.] 



3.2 Glioblastoma Multiforme(GBM) data 



GBM data by 



Bredel et al. 



(120051 ) include 26 samples representing primary GBMs, the most malig- 



nant type of brain tumor. In sample GBM31 , a larg e region of loss is demonstrated on chromosome 



13, which is also observed by 



Koschny et al. 



(120021 ) in a meta-analysis of 509 cases. Besides losses, 
the GBM data als o contain a num ber of amplifications, one of which is shown on chromosome 7 

( 120051 ) compared the performance of various methods based on these 



in sample GBM29. 



Lai et al. 



two chromosomes 13 and 7 with challenging features. They represent wider, low level region of loss, 
and narrower, high level region of amplification, respectively. To assess our proposed method, we 
re-analyze these two chromosomes using our method. The analysis is based on 1000 MCMC draws 
from a single chain of 75, 000 iterations with 25, 000 burn- in period and every 50th being recorded. 
As shown in Figure [3], our method successfully detects both the loss region and amplification region 
as well as some outliers. Both breakpoints and outliers are called using the proposed backward se- 
lection procedure. The threshold for breakpoints is obtained through the pseudo normal reference 



arrays with 



001 



0.911 for chomosome 7 and q$ 



001 



0.882 for chomosome 13. The threshold 
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for outliers is chosen as q e = 0.98. The panels in Figure |3] also illustrate posterior means and 95% 
credible intervals for signal fij, error Ej and signal differences £j across the chromosomes. At a given 
position, the wider interval indicates higher uncertainty. Note that 95% credible intervals of signal 
difference illustrate the corresponding posterior distributions. The further the credible interval de- 
parts from the others along with the narrower width, the stronger it indicates the corresponding 
position is a breakpoint. We also analyze the GBM data using the methods of FLASSO, CBS and 
SMAP. As Figure H] shown, all three methods can identify the two aberration regions, except SMAP 
method that fails to detect any aberration region for chromosome 13. 

Tabled] lists the number of breakpoints identified by each of the four methods. Our method and 
CBS reach the same numbers on both chromosomes, which are much less than the those found by 
FLASSO. 

[Figure 3 about here.] 
[Figure 4 about here.] 
[Table 1 about here.] 



3.3 Breast tumor data 



Fridlyand et al\ (j2006f ) considered array CGH data from across 2464 genomic clones in 62 sporadic 
ductal invasive breast tumors and 5 BRCAl mutant tumors. We apply our method as well as other 
three methods to analyze four chromosomes(8,ll,17 and 20) of tumor "S1539", in which there are 
a number of low level gains and losses as well as high level amplifications. The results of our 
method are based on 1000 MCMC draws from a single chain of 75, 000 iterations with 25, 000 burn- 
in period and every 50th being recorded. The backward selection procedure has been applied to 
identify a number of breakpoints and outliers/amplifications. The q e is specified as 0.98, and Qo.ooi 
is determined using the pseudo normal reference arrays, resulting in values of 0.795,0.789,0.808 
and 0.807 for chromosome 8,11,17 and 20 respectively. Figure displays the posterior means and 
95% credible intervals of signal fij, error Ej and signal differences £j across the chromosomes, as well 
as a number of called outliers and breakpoints. These breakpoints define the edges of aberration 
regions which include several well-known oncogenes, that play key roles in the pathogenesis of breast 
tumor. The detected regions cover gene FGFR I between 36.4Mb and 39.7Mb on chromosome 8, 
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gene CCND I between 68.5Mb and 77.0Mb on chromosome 11, and gene ZNF217 between 44.4Mb 
and 62.7Mb on chromosome 20. Gene ERBB2 between 34.1Mb and 38.7Mb on chromosome 17 is a 
well known gene that can be amplified in breast cancer. There are very few probes close to ERBB2, 
and the method detected a probe as an outlier in this region. 

We also analyze the same breast tumor data by FLASSO, CBS and SMAP methods. The 
results are shown in Figure M We can see that the SMAP method appears to be very sensitive to 
outliers(e.g. in chromosome 11) and local features(e.g. in chromosome 20), which has obscured the 
estimate of the global trend. The CBS method failed to capture the single probe amplification in 
the chromosome 17 and the weak gain in chromosome 20. The FLASSO method is also sensitive to 
outliers, e.g. at the beginning of chromosome 8 and in the middle of chromosome 11. The number 
of breakpoints identified by each of the methods is summarized in Table [TJ FLASSO identifies a 
large number of breakpoints, our method identifies slightly more breakpoints than CBS and slightly 
fewer than SMAP. 

[Figure 5 about here.] 
[Figure 6 about here.] 

4 Discussion 

In this paper, we have proposed a powerful new method based on a robust state space model to 
detect CNVs from array CGH data. A key feature of the proposed method is the use of heavy tail 
t-distributions, which facilitates the robustness in the calling of breakpoints and outliers. Through 
an MCMC algorithm, our approach presents an appealing method for CGH profile estimation 
and detection of breakpoints. Our method is based on a probability model that gives not only 
point estimation, but also uncertainty intervals for the signal, signal difference and measurement 
error magnitudes, as illustrated in Figure [3] and |5j Such displays are very useful for visualizing 
the data and the degree of confidence in any conclusion. We developed a novel backward selec- 
tion procedure to effectively utilize the MCMC samples in the identification of breakpoints and 
outliers/amplifications. Importantly, we control the false positive rate of feature detection at a 
prespecified level by using real or pseudo normal reference arrays. As illustrated by both simulated 
and real datasets, our approach has demonstrated superior detection power for aberration regions 
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and breakpoints, and outperforms other existing methods in most of cases, especially for noisy data 
with outliers. 
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Supplementary Material 



Algorithm 1 Backward selection procedure for the breakpoints 
Input: M mxn , q a 

1. J <- and flag <- true 

2. repeat 

3. for j = 1 to n and j ^ J do 

4. V_j m samples without replacement from columns X of M, X = {i : % ^ j and % ^ J"} 

5. Vj <— column j of M 

e. ^-^Er=iE" = iAi^#]i>i^[^]i) 

7. end for 

8. if 3j G {1, 2, . . . , n} and j ^ J : Pj > q a then 

9. ./ • ./ : Pj > I'y all ./' / ./ 

10. J<-J\J{j} 

11. else 

12. flag <- false 

13. end if 

14. until flag = false or number of elements in J = n — 1 

15. if number of elements in J = n — 1 then 

16. J<-{l,2,...,n} 

17. end if 
Output: J 
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Figure 1: ROC curves of four methods at SNR 1. - Our model, FLASSO, — • — CBS, • • • 

SMAP. 
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Figure 2: Breakpoints identification using simulated and pseudo normal reference arrays. The hori- 
zontal reference lines indicate the true number of breakpoints. The simulated normal reference array 
is labeled by RSSM0, while RSSM1 utilizes pseudo normal reference arrays based on resampling 
the observed data. The panel on bottom left is the replicate of the one on top left. 
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Figure 3: GBM panel plots for the posterior distributions of measurement error, signal, and signal 
difference by state space model. In the top and bottom panels, the • denotes the posterior mean 
and | stands for the 95% credible intervals. In the middle panel, gray • is the data point and — is 
posterior mean and 95% credible intervals are the shaded areas. ♦ denotes the selected outliers and 
breakpoints. 
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Figure 4: Panel plots of signal( — ) estimated for GBM data by FLASSO, CBS and SMAP, where 
gray • denotes the data point. 
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Figure 6: Panel plots of signal( — ) estimated for breast tumor data by FLASSO, CBS and SMAP, 
where gray • denotes the data point. 
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Table 1: The number of breakpoints identified in GBM and Breast Tumor data 



GBM data Breast Tumor data 
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